Lightweight bobbin yarn detection model for auto-coner with yarn bank

The automated replacement of empty tubes in the yarn bank is a critical step in the process of automatic winding machines with yarn banks, as the real-time detection of depleted yarn on spools and accurate positioning of empty tubes directly impact the production efficiency of winding machines. Addressing the shortcomings of traditional methods, such as poor adaptability and low sensitivity in optical and visual tube detection, and aiming to reduce the computational and detection time costs introduced by neural networks, this paper proposes a lightweight yarn spool detection model based on YOLOv8. The model utilizes Darknet-53 as the backbone network, and due to the dense spatial distribution of yarn spool targets, it incorporates large selective kernel units to enhance the recognition and positioning of dense targets. To address the issue of excessive focus on local features by convolutional neural networks, a bi-level routing attention mechanism is introduced to capture long-distance dependencies dynamically. Furthermore, to balance accuracy and detection speed, a FasterNeck is constructed as the neck network, replacing the original convolutional blocks with Ghost convolutions and integrating with FasterNet. This design minimizes the sacrifice of detection accuracy while achieving a significant improvement in inference speed. Lastly, the model employs weighted IoU with a dynamic focusing mechanism as the bounding box loss function. Experimental results on a custom yarn spool dataset demonstrate a notable improvement over the baseline model, with a high-confidence mAP of 94.2% and a compact weight size of only 4.9 MB. The detection speed reaches 223FPS, meeting the requirements for industrial deployment and real-time detection.


Large selective kernel unit
LSKNet (large selective kernel net), proposed by Yuxuan Li and colleagues, is designed to address challenges in the field of sensor image detection.The model achieves focus on the background regions most relevant to the target under examination by decomposing and selecting large convolutional kernels.The LSKUnit Attention module structure is shown in Fig. 2. The LSKUnit serves as the foundational module of LSKNet, and its structure is illustrated in the above diagram.The input features of the module undergo convolution with multiple decomposed large convolutional kernels, yielding feature maps corresponding to different large receptive fields.Subsequently, these feature maps are concatenated along the channel dimension.Following this concatenation, average pooling and max pooling layers are applied to effectively extract spatial feature information.The mathematical expression for this feature operation is as follows: In the above equation, Kernel 1(F in ) refers to the convolution of the input feature F with the first convolutional kernel.AvgPool and MaxPool represent average pooling and max pooling, respectively.
After obtaining spatial features, a convolutional layer followed by a Sigmoid activation function is employed to facilitate the fusion and interaction of different spatial features.This process results in the generation of attention module weights.Subsequently, these weights are used to weight the features outputted by the large convolutional kernels, producing the final output features of the module.The mathematical expression for this feature operation is as follows: In the above equation, Conv represents the Conv2D module, and Sigmoid denotes the sigmoid activation function.
By incorporating the adaptive kernel adjustment function of the LSKUnit, the features input into the LSKUnit undergo a serial operation of two large convolutional kernels, achieving dynamic receptive field adjustment.This design enables the LSKUnit to adaptively extract different scale features of the bobbin.Compared to using a single larger convolutional kernel, the serial approach with multiple large kernels offers advantages in both computational speed and parameter efficiency.Additionally, the use of multiple convolutional kernels of different sizes allows the module to generate spatial features with multi-scale and multi-receptive field characteristics, enriching the target features and enhancing the backbone network's ability to focus on background features at a lower computational cost.The schematic diagram of LSKUnit is shown in Fig. 3.

Biformer
As is well-known, compared to convolutional neural networks, the core advantage of Transformers lies in their ability to capture long-range dependencies through self-attention mechanisms.While this structure can enhance model performance to a certain extent, it comes with the drawback that traditional self-attention modules consume a significant amount of memory and entail high computational costs.
To address this issue, Lei Zhu et al. 32 proposed the Biformer model.This model reduces computational costs by introducing a two-level dynamic sparse attention mechanism, known as BRA.The key idea is to initially apply coarse-grained filtering to the input features, ignoring most of the irrelevant key-value pairs, and then perform fine-grained token-to-token attention on the remaining regions.The module structure diagram is shown in Fig. 4.
(1) The Biformer module follows a similar architecture to many Transformer models, consisting of a self-attention unit and a multilayer perceptron unit.Unlike other ViT models, Biformer introduces a novel self-attention unit called BRA (Bi-level Routing Attention).In the BRA module, the input features with a height of H, width of W, and C channels are divided into S*S regions.Each region is then processed with learnable weight matrices W q , W k , and W v to obtain three vectors: Q (queries), K (keys), and V (values).The calculation formula is as follows: The calculation formula for the self-attention mechanism is as follows: The constant C in the equation is introduced for gradient stability normalization.From the above formula, it can be observed that the self-attention mechanism involves a weighted operation on the V matrix.The weight values QK T are used to measure the correlation of features from other different regions with the selected region Q.A top-k algorithm is applied to select the k regions that exhibit the highest correlation with the current selected region Q, achieving a coarse-grained key-value filtering.
Subsequently, the obtained key-value pairs of the k most relevant regions to the currently selected region are merged, and fine-grained token-to-token attention calculation is performed:  www.nature.com/scientificreports/By introducing Biformer to enhance the model's ability to capture long-distance dependencies, the network's awareness of contextual features is strengthened.Additionally, due to the reduced computational complexity and faster attention operations of Bi-level Routing Attention compared to other self-attention mechanisms, a better balance is achieved between model accuracy and detection speed.The schematic diagram of Bi-level Routing Attention is shown in Fig. 5.

Lightweight improvement of neck
The advancement of deep neural networks has propelled significant progress in various computer vision tasks, leading to higher accuracy in areas such as object detection and image recognition.However, this progress is accompanied by larger models, deeper network structures, and increased computational costs.For less complex downstream computer vision tasks, achieving a balance between accuracy and speed is crucial to effectively apply the technology in practical scenarios.
GhostConv, introduced by Han 9 and others, was developed based on the observation that the feature maps extracted by the backbone network of ResNet-50 often include many highly similar feature maps, referred to as Ghost pairs.The researchers found a positive correlation between these Ghost pairs and the model's feature extraction capability.Therefore, they proposed using a simple linear transformation to generate multiple sets of Ghost pairs, aiming to enhance the model's feature extraction ability while reducing the computational cost associated with convolutions.This is the core idea behind GhostNet.In the construction of the module, the implementation of the simple linear transformation utilizes DWConv (Depthwise Convolution).An example of GhostConv schematically paired with a small amount of Ghost is shown in Fig. 6a,b.
FasterBlock was introduced by Jierun Chen and collaborators in response to the limitations of classic lightweight convolutions such as DWConv (Depthwise Convolution), which primarily focus on reducing FLOPs (Floating-Point Operations) without effectively optimizing FLOPS (Floating-Point Operations per Second).Experimental findings revealed that frequent memory access in DWConv resulted in a lack of substantial speed improvement in actual computational processing units (CPU or GPU) during correlation operations.To address this issue, Jierun Chen and the team developed a technique called PConv (Partial Convolution) and used it as the foundation to construct FasterNet.FasterNet incorporates a significant number of PConv operations, which are a form of partial convolution, meaning that only a subset of channels in the input feature is convolved.This approach significantly reduces both computational and memory access demands compared to full convolutions.However, this convolution method may neglect information from certain input feature channels.To mitigate this, a point-wise convolution (Point-Wise Convolution) is applied after PConv, enhancing the utilization of other channels in the input feature without incurring excessive computational costs.The FasterBlock structure is shown in Fig. 6c.To address this issue, this paper utilizes WIoU (Weighted IoU) as the bounding box regression loss function.WIoU incorporates a non-monotonic dynamic focusing mechanism, and its specific implementation is as follows: In the above equation, β is defined as the outlying degree of the current data.A larger value indicates lower data quality, making it more challenging to detect, while a smaller value suggests higher quality, making it easier to detect.γ is a non-monotonic dynamic focusing coefficient, and δ and α are hyperparameters.When β equals δ , γ is set to 1, indicating no weighting of the loss for training data.Since loss IoU represents the average IoU loss of the training-involved data, which can be self-learned, γ dynamically changes with the progress of training.The calculation formula for WIoU is as follows: (x, y) represents the center position of the ground truth bounding box, and (x gt , y gt ) represents the center position of the predicted bounding box.The equation above indicates that this loss function is composed of a non-monotonic dynamic focusing coefficient, a center distance penalty mechanism, and the original IoU loss function.
Compared to the original CIoU in the model, WIoU can dynamically adjust the loss weights for different detection difficulty during training.This enables the model to focus more on moderately challenging training data, thereby improving the model's robustness while reducing the risk of overfitting.

Experimental platform and dataset
The software and hardware information of the experimental platform is shown in Table 1.The dataset used in the experiments comprises two parts: one is collected from idle yarn storage in a textile factory's production line, and the other consists of directly captured images of yarn tubes and yarn storage.After basic image augmentations such as horizontal flipping and vertical flipping, a total of 628 images were obtained.The dataset contains approximately 12,000 annotated targets, classified into categories of with yarn tubes and without yarn tubes.The dataset was partitioned into training, validation, and test sets in a ratio of 8:1:1.The labeling of the target is shown in Fig. 7 below: To ensure the effectiveness and rigor of the experimental result analysis, the study employed precision (P), recall (R), and mean average precision (mAP) as the primary evaluation metrics for the models.In the comparison of models with different structures, model size and inference frames were used as additional reference metrics.
] * loss IoU where TP is the number of true positive samples detected, FP is the number of false positive samples (misclassified as negative), TN is the number of true negative samples detected as negative, FN is the number of false negative samples (missed positive samples), AC is the overall accuracy across all classes, and C is the number of defect categories.

Experiment on network structure improvement
Given the considerations of industrial application costs and the difficulty of the empty bobbin visual detection task, this paper utilized the YOLOv8 nano baseline model as the benchmark for comparative experiments in the network structure improvement.Compared to other YOLOv8 baseline models, the nano model has the advantages of small size, fast inference speed, and low computational cost.It is one of the most suitable models for deployment in industrial real-time detection.However, due to its minimal network depth and width, its accuracy is not as high as the other YOLOv8 baseline models.The performance of YOLOv8n on the bobbin dataset is shown in the Table 2 below: From the above experimental results, it can be observed that the model achieves a high level of average accuracy at a confidence threshold of 0.5.However, as the confidence threshold increases, the precision significantly decreases, indicating the occurrence of false positives or false negatives within the high-confidence interval.Additionally, there is substantial room for improvement in the model's detection speed.3 below: From the Table 3 above, it can be observed that the LSK attention mechanism and Biformer perform well on the dataset used in this study.The data from Experiment Groups 1 and 2 indicate that the introduction of LSKUnit improves the model's average accuracy by 0.4%, and the Biformer module further enhances the average accuracy by 0.1%.
When combining the above modules with the C2f module, replacing the original Bottleneck module with the new attention mechanism module, and allowing the model to receive attention-weighted multi-gradient information, the model achieves local feature fusion with minimal computational cost.This enhances the accuracy of the model while maintaining its lightweight nature.Experiment data from Groups 3 and 4 show that, with the introduction of the C2f-LSK module and C2f-BRA module, the baseline model's mAP is improved by 0.4% and 0.5%, respectively, with varying degrees of improvement in detection speed.In Experiment 5, the C2f-LSK module and the C2f-BRA module were combined and applied to the backbone of the baseline model.The C2f-LSK module, through its mechanism of dynamically adjusting the receptive field, significantly improved the model's performance in detecting dense targets, enhancing both detection accuracy and speed.The C2f-BRA module, by enhancing the model's ability to capture global information and integrate contextual features, further enriched the target feature representation, improving the model's performance in complex backgrounds.Experimental data show that the average accuracy of the combined model improved by 0.4%, and the detection speed increased to 230 fps.
The heatmaps in Fig. 8 illustrate the impact of different attention mechanisms on feature extraction.The colors in the heatmaps indicate the model's attention to different regions, with red areas representing regions of higher attention and blue areas representing regions of lower attention.From the heatmaps, it can be seen that the baseline model (Baseline) focuses primarily on target regions but has some omissions in feature extraction within complex backgrounds, affecting detection accuracy at high confidence levels.In contrast, the C2f-LSK module's heatmap shows more uniform attention across regions, particularly with more detailed feature extraction at the edges of target objects, enhancing the model's performance in complex backgrounds.The C2f-BRA module increases attention to target regions, enriches feature extraction details, and further improves background feature extraction.The combined C2f-LSK and C2f-BRA module leverages the strengths of both, exhibiting multi-scale and multi-receptive field characteristics in the heatmaps, making target features more prominent and background feature extraction more comprehensive.This combination leads to an overall performance improvement.These experimental results strongly validate the effectiveness of the proposed structure.The experiments demonstrate that the improved C2f-LSK and C2f-BRA modules enhance the model's average accuracy on the bobbin dataset without compromising detection speed.. Therefore, this paper selects the improved C2f-LSK module to enhance the model's feature perception capability for dense yarns.Through its multi-scale large kernel selection mechanism, the module enables the model to focus on both the foreground and background of the detected target, reducing false positives and false negatives.Simultaneously, the introduction of the C2f-BRA module prevents the model from overly focusing on local information and strengthens the model's perception of global information.
In addition, for the textile industry, achieving a good balance between detection efficiency, accuracy, robustness, and lightweight characteristics is crucial.If any one of these factors is limited, it becomes challenging to apply the model to actual industrial production.Therefore, to balance production costs and detection Table 2. Performance of the baseline model YOLOv8n on the dataset used in this study. 1In this paper, unless otherwise specified, mAP@0.5 refers to the model's average precision at a confidence threshold of 0.5, and mAP@0.5-0.95refers to the average precision across different confidence thresholds ranging from 0.5 to 0.95 with a step size of 0.05.

Class
P R mAP@0.5 mAP@0.WIoU, as a bounding box regression loss function with a non-monotonic dynamic focusing property, relies on the implementation of the non-monotonic dynamic focusing mechanism, primarily governed by the focusing coefficient γ (Eq. 12).The dynamic variation of the focusing coefficient γ is achieved by defining the outlier degree β as the ratio of the current anchor box's IoU loss to the average anchor box's IoU loss (Eq. 12).During training, the focusing coefficient γ undergoes real-time iteration to achieve dynamic changes.The shape of the coefficient's function curve varies based on different hyperparameters δ and α .The curve of the coefficient func- tion is illustrated in the Fig. 9 below: A smaller outlier degree, represented by β , indicates higher quality for the current bounding box, implying excellent training data.In such cases, we aim to prevent the model from overly focusing on it.Therefore, assigning a small gradient weight to it through the focusing coefficient γ helps enhance the model's generalization capabil- ity.On the other hand, when the outlier degree β is large, indicating poor quality for the current bounding box (low-quality training data), we assign it a very low gradient weight to avoid harmful gradients from low-quality data.For cases with moderate-quality training data, we assign a high gradient weight to it through the coefficient γ to make the model focus on moderate-quality training data and improve overall performance.
From the figure above, it is evident that different choices of hyperparameters have a significant impact on the gradient weights assigned to training data.To identify the most suitable hyperparameters for the yarn dataset, we conducted a controlled experiment.The experimental data are shown in Table 5 below: The experimental results of WIoU hyperparameter selection indicate that WIoU performs best on the yarn dataset when the hyperparameters are chosen as δ = 3 and α = 1.9.As shown in Table 6, in the subsequent com- parative experiment, models utilizing WIoU achieved a 0.2% increase in average accuracy compared to models using CIoU.This confirms that, in contrast to the original CIoU, WIoU is better suited for the target detection task on the yarn dataset.From the experimental data of Experiment Group 1 and Experiment Group 2, it is evident that the introduction of the improved C2f-LSK module and C2f-BRA module can effectively enhance the network's accuracy.However, the deepening of the model may exacerbate the loss of shallow features in the network.Additionally, the introduction of more convolutions also causes the model to focus more on local information.Therefore, by introducing the C2f-BRA module with a self-attention mechanism alongside the C2f-LSK module, the model's attention to global information is enhanced.The data from Experiment Group 3 shows that this improvement approach is effective, resulting in a 0.4% increase in the model's average precision.
The results from Experiment Group 4 and Experiment Group 5 demonstrate that the C2f-Faster and Ghost-Conv modules can significantly improve the model's detection speed without causing a loss in accuracy.Experiment 6, which combines both modules, shows that the lightweight improvement in the Neck structure leads to a noticeable reduction in parameter count and weight size, along with a significant increase in detection speed.
Lastly, the data from Experiment 7 and Experiment 8 strongly confirms the superiority of the proposed structure.With the introduction of the WIoU loss function, the model's accuracy improves by 0.8% compared to the baseline model.Additionally, the detection speed increases by 46FPS, and there is a substantial reduction in model weight size and parameter count.
Comparison experiments of detection effects under different conditions are shown in Fig. 10.The detection results above indicate that the proposed model, achieving lightweight improvements, maintains detection performance comparable to the baseline model.Moreover, in some extreme conditions, the detection performance surpasses that of the baseline model (such as low-light environments and large pose angles, remaining on par www.nature.com/scientificreports/under strong lighting).This achievement is mainly attributed to the introduction of the WIoU loss function, enhancing the model's generalization ability and enabling adaptation to various application scenarios.
To validate the performance of the improved lightweight bobbin detection network model, we compared it with current mainstream object detection algorithms.The experimental results are shown in Table 8.
From the experimental results in Table 8, it can be seen that the Faster-RCNN network model has the highest parameter count and computational load, resulting in a larger model weight file and lower detection frame rate.Although the SSD network model has a higher detection frame rate than Faster-RCNN, its detection accuracy is lower, making it unsuitable for deployment in embedded mobile devices and industrial environments with limited computational resources.In comparison to these two-stage object detection models, the improved lightweight bobbin detection model based on YOLOv8 proposed in this paper shows superior performance in terms of detection accuracy and parameter count.Compared with the baseline model YOLOv8n, the average detection accuracy is improved by 0.8%, the parameter count is reduced by 0.72 M, and the detection frame rate is increased by 46fps, enabling effective bobbin detection in the spinning environment.

Conclusion
This paper introduces a lightweight and high-precision small model based on an improved YOLOv8 for yarn cone detection in the yarn warehouse-type automatic winding machine, demonstrating excellent performance.
Starting from the perspective of model accuracy optimization, experiments were conducted to enhance the model's detection accuracy.The LSKUnit, featuring a large kernel selection mechanism, effectively improves the model's ability to extract shallow features.Its utilization of multiple large convolutional kernels enriches the model's output features, providing it with multiscale spatial information.Subsequently, the Biformer is employed to reinforce the model's contextual association capabilities, addressing the limited attention to global information in convolutional neural networks.Additionally, the combination of LSKUnit and Biformer with the C2f module in the baseline model is experimentally validated, highlighting the advantages of the improved C2f-LSK and C2f-BRA modules.
For industrial detection applications, with a focus on reducing model deployment costs, the model undergoes lightweight deployment improvements.FasterBlock is proven effective in accelerating both the model's detection and training processes.Consequently, integrating FasterBlock with the C2f module in the Neck section of the baseline network reduces model parameters while leveraging the C2f module's multi-gradient flow concept, allowing PWConv within FasterBlock to utilize information from different gradients, compensating for precision losses caused by PConv.Substituting GhostConv for the regular convolutional modules in the Neck section, GhostConv employs a simple linear transformation to produce convolution-like effects, reducing model parameters and computational costs.Subsequent experiments demonstrate the effectiveness of the C2f-Faster module and GhostConv in improving model detection speed and achieving model lightweighting without sacrificing accuracy.
Finally, to enhance model generalization and robustness, the WIoU loss function is employed as a replacement for CIoU.The non-monotonic dynamic focusing mechanism in WIoU assigns different weight gradients to training data of varying qualities, preventing the model from excessively focusing on challenging or easily identifiable examples, thus improving the model's stability.
Currently, the yarn cone dataset has fewer camera viewpoints and a relatively limited range of yarn cone and yarn warehouse poses, leading to challenges in detection accuracy under large tilt angles.Furthermore, the dataset includes a limited variety of yarn cone classifications.In future work, we plan to supplement the dataset with images capturing yarn cones and yarn warehouses from various angles and poses, enhancing yarn cone color classification, and introducing residue classification to further refine the yarn cone detection algorithm serving the yarn warehouse-type automatic winding machine.

Figure 1 .
Figure 1.The improved model architecture diagram.denotes the concatenation operation.

Figure 4 .
Figure 4. Biformer module structure diagram.In the diagram, denotes concatenation operation, and DWConv represents the depthwise separable convolution module.

Figure 5 .
Figure 5. Bi-level routing attention principle schematic diagram.In the diagram, represents matrix multiplication.

Table 1 .Figure 7 .
Figure 7. Annotation example of the self-made bobbin dataset.Bobbins inside the bobbin rack are annotated as "full," and those without bobbins are annotated as "empty."

Figure 8 .
Figure 8. Visualization comparison of heatmap data under the influence of different attention mechanisms.

Figure 9 .
Figure 9. Function curve of focusing coefficient of different hyperparameter selection.

Table 3 .
Performance improvement experiment results.

Table 4 .
Lightweight improvement experimental results.include challenges like varying distances and aspect ratios.The use of geometric metrics intensifies the penalties for low-quality examples, leading to a decline in the model's generalization performance.An effective bounding box regression loss function should aim to avoid excessive focus on low-quality training data, thereby minimizing interference with the training process and enhancing the model's generalization performance.

Table 6 .
The loss function improves the experimental results.To demonstrate the effectiveness of the improved network in enhancing the model's performance on the yarn dataset, this section employs an ablation study to evaluate the performance outcomes of the model enhancements.The experiment includes five variables: C2f-LSK module, C2f-BRA module, C2f-Faster module, GhostConv, and WIoU loss function.The evaluation metrics used are average precision, model weight size, parameter count, and inference speed.The results of the ablation study on the improved model proposed in this paper are presented in the Table7below: Vol.:(0123456789) Scientific Reports | (2024) 14:16136 | https://doi.org/10.1038/s41598-024-67196-2

Table 7 .
Results of ablation experiment.

Table 8 .
Different models compare experimental results.